Cross-Modal Distillation for Speaker Recognition
نویسندگان
چکیده
Speaker recognition achieved great progress recently, however, it is not easy or efficient to further improve its performance via traditional solutions: collecting more data and designing new neural networks. Aiming at the fundamental challenge of speech data, i.e. low information density, multimodal learning can mitigate this by introducing richer discriminative as input for identity recognition. Specifically, since face image than recognition, we conduct a model (teacher) transfer knowledge speaker (student) during training. However, distillation trivial because big domain gap between easily lead overfitting. In work, introduce framework, VGSR (Vision-Guided Recognition). propose MKD (Margin-based Knowledge Distillation) strategy cross-modality loose constrain align teacher student, greatly reducing Our adapt various existing methods. addition, QAW (Quality-based Adaptive Weights) module weight samples quantified quality, leading robust Experimental results on VoxCeleb1 CN-Celeb datasets show our proposed strategies effectively accuracy margin 10% ∼ 15%, methods are very different noises.
منابع مشابه
Cross-Modal Supervision for Learning Active Speaker Detection in Video
In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion facial expressions and gesticulations associated with speaking. We further improve a generic model for acti...
متن کاملGeneralized Distillation Framework for Speaker Normalization
Generalized distillation framework has been shown to be effective in speech enhancement in the past. We extend this idea to speaker normalization without any explicit adaptation data in this paper. In the generalized distillation framework, we assume the presence of some “privileged” information to guide the training process in addition to the training data. In the proposed approach, the privil...
متن کاملAudiovisual speaker identity verification based on cross modal fusion
In this paper, we propose the fusion of audio and explicit correlation features for speaker identity verification applications. Experiments performed with the GMM based speaker models with hybrid fusion technique involving late fusion of explicit cross-modal fusion features, with eigen lip and audio MFCC features allow a considerable improvement in EER performance An evaluation of the system pe...
متن کاملCross-Modal Object Recognition Is Viewpoint-Independent
BACKGROUND Previous research suggests that visual and haptic object recognition are viewpoint-dependent both within- and cross-modally. However, this conclusion may not be generally valid as it was reached using objects oriented along their extended y-axis, resulting in differential surface processing in vision and touch. In the present study, we removed this differential by presenting objects ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2023
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v37i11.26525